Summary of dataset

White wine Quality is a tidy data set which contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine.At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

load the libraries

library(ggplot2)
library(dplyr)
library(gridExtra)
library(knitr)


load the datasets

Whitewine = read.csv('wineQualityWhites.csv',header = T, row.names = 1)

Exploring the dataset

names(Whitewine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

All the variablesare numbers and there exist no factor type in the dataset.

summary(Whitewine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Quality values are between 3 and 9.Median and Mean is very close to each other which means the distribution is not so-skewed. ***


Univariate Plots

ggplot(aes(x = quality), data = Whitewine) + geom_bar() + 
  scale_y_continuous(breaks = seq(0,2250,250)) + 
  scale_x_continuous(limits = c(3,10), breaks = seq(3,9,1))

The plot shows that the quality of the wines are highest at value of 6. There exist very few wines having quality score of 9.

Let’s explore the other variables of the dataset and plot their distributions.

grid.arrange(ggplot(aes(x = fixed.acidity), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = residual.sugar ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Outliers

ggplot(aes(x = fixed.acidity), data = Whitewine) + 
  geom_histogram(binwidth = 0.1) + 
  scale_x_continuous(breaks = seq(0,15,1))

The distribution of acidity is very close to normal distribution. But there are some outliers in the data.

summary(Whitewine$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

It can be seen from summary table and outlier graph that there exist few data points between 3rd Quantile and Max values.

After trim top 1 percentile , the below graph below which wil be normal.

ggplot(aes(x = fixed.acidity), data = Whitewine) + 
  geom_histogram(binwidth = 0.1) + 
  scale_x_continuous(breaks = seq(0,15,1), 
                     limits = c(quantile(Whitewine$fixed.acidity, 0.01) ,
                                quantile(Whitewine$fixed.acidity, 0.99)))
## Warning: Removed 75 rows containing non-finite values (stat_bin).

2)Volatile.acidity

grid.arrange(ggplot(aes(x = volatile.acidity), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = volatile.acidity ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(aes(x = volatile.acidity), data = Whitewine) + 
  geom_histogram(binwidth = 0.01) + 
  scale_x_continuous(breaks = seq(0,1.1,0.1)) 

summary(Whitewine$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

In the dataset, there are some extreme points which make dataset skewed.Lets trim the top 1 percentile ,The below graph is obtained:

ggplot(aes(x = volatile.acidity), data = Whitewine) + 
  geom_histogram(binwidth = 0.01) + 
  scale_x_continuous(breaks = seq(0.1,1.1,0.1), 
                     limits = c(0,quantile(Whitewine$volatile.acidity, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).

3)Citric.acid

grid.arrange(ggplot(aes(x = citric.acid), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = citric.acid ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(aes(x = citric.acid), data = Whitewine) + 
  geom_histogram(binwidth = 0.01) + 
  scale_x_continuous(breaks = seq(0.1,1.5,0.1)) 

In citric acid feature there exist so many high and low outlier values. Therfore trimming them will make distribution better.Extra attention should be given to 0.5 point.

summary(Whitewine$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

In citric acid values, there are also extreme high values .Lets omit the top 1 percentile then we are able to get close normal like distribution:

ggplot(aes(x = citric.acid), data = Whitewine) + 
  geom_histogram(binwidth = 0.01) + 
  scale_x_continuous(breaks = seq(0.1,1.5,0.1), 
                     limits = c(0,quantile(Whitewine$citric.acid, 0.99)))
## Warning: Removed 22 rows containing non-finite values (stat_bin).

table(Whitewine$citric.acid)
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##   19    7    6    2   12    5    6   12    4   12   14    1   19   17   27 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   23   33   27   49   48   70   66  104   83  181  136  219  216  282  223 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##  307  200  257  183  225  137  177  134  122  101  117   82   95   37   63 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   46   51   38   39  215   35   25   23   16   19   11   22   13   21    6 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    6    9   14    4    6    8    7    7    7    5    3    9    5    5   41 
## 0.78 0.79  0.8 0.81 0.82 0.86 0.88 0.91 0.99    1 1.23 1.66 
##    2    2    2    2    2    1    1    2    1    5    1    1

4)Residual.Sugar

grid.arrange(ggplot(aes(x = residual.sugar), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = residual.sugar ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From outliers we can see so few high outlier points.Therfore trimmimng it to get better results.

ggplot(aes(x = residual.sugar), data = Whitewine) + 
  geom_histogram(binwidth = 0.1)

summary(Whitewine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual.sugar distribution is highly skewed.There exist few extremely high values but no outliers.

After trimming ,the below graph is obtained:

ggplot(aes(x = residual.sugar), data = Whitewine) + 
  geom_histogram(binwidth = 1, fill = '#5760AB') + 
  scale_x_continuous( limits = c(0.6, 
                                 quantile(Whitewine$residual.sugar, 0.99)), 
                      breaks = seq(0,50,1))
## Warning: Removed 47 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

In this part the distribution is multimodal. Therefore, many wines with various residual sugar levels exist. One includes very little residual.sugar, one is sweet(5), other is sweet(approx. 8).

5)Chlorides

grid.arrange(ggplot(aes(x = chlorides), data = Whitewine) + 
               geom_histogram(color = 'Black',  fill = '#F79420'),
             ggplot(aes(x = 1, y = chlorides ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The box plot has huge amount of outliers which means the distribution is highly skewed.It is difficult to understand the graph using bin sizes so we should narrow down them for better visualization.

summary(Whitewine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
ggplot(aes(x = chlorides), data = Whitewine) + 
  geom_histogram(binwidth = 0.001, fill = '#5760AB')

The distribution is good but the spread of data is wide.We will omit 1 % of data for more clear visualization.

ggplot(aes(x = chlorides), data = Whitewine) + 
  geom_histogram(binwidth = 0.001, fill = '#5760AB') + 
  scale_x_continuous( limits = c(0, quantile(Whitewine$chlorides, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).

We can see that most of the data is clustered around 0.05, there exist considerable amount of data above 0.05. Large amount is aggregated around 0.5 and wide spread value greater than 0.10.

table(Whitewine$chlorides)
## 
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 0.021 0.022 
##     1     1     1     4     4     5     5    10     9    16    19    19 
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 0.033 0.034 
##    20    34    30    54    58    85    81   108   107   109   119   168 
## 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 0.045 0.046 
##   130   200   160   167   157   182   147   184   141   201   170   181 
## 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 
##   171   174   133   170   115   104   130    99    61    88    68    53 
## 0.059  0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 
##    36    46    19    25    23    15     8    18    18     7    18     6 
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 
##     5     2     5     8     2     9     1     2     4     4     2     2 
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 
##     5     5     3     4     3     2     1     2     1     3     3     5 
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108  0.11 0.112 0.114 
##     2     6     1     3     1     1     1     1     2     3     1     1 
## 0.115 0.117 0.118 0.119  0.12 0.121 0.122 0.123 0.126 0.127  0.13 0.132 
##     1     3     1     3     1     2     1     4     3     2     1     1 
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149 
##     1     1     1     2     2     3     1     1     1     2     1     1 
##  0.15 0.152 0.154 0.156 0.157 0.158  0.16 0.167 0.168 0.169  0.17 0.171 
##     1     2     1     1     4     1     2     2     3     2     2     1 
## 0.172 0.173 0.174 0.175 0.176 0.179  0.18 0.184 0.185 0.186 0.194 0.197 
##     2     2     2     2     2     1     1     2     2     1     1     2 
##   0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239  0.24 0.244 0.255 
##     1     2     1     2     1     1     1     1     1     1     1     1 
## 0.271  0.29 0.301 0.346 
##     1     1     1     1

6) Free.sulfur.dioxide

grid.arrange(ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = free.sulfur.dioxide ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are so many outliers as most than other features.Trimming will make it better analysis. Lets arrange first binwidths for deeper insight.

ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) + 
  geom_histogram(binwidth = 1)

Lets check summary statistics :

summary(Whitewine$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

There exists some extremely large variables similar to others.if top 1 percentile is omitted:

ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) + 
  geom_histogram(binwidth = 1, fill = '#5760AB') + 
  scale_x_continuous( limits = c(0, quantile(Whitewine$free.sulfur.dioxide, 0.99)))
## Warning: Removed 43 rows containing non-finite values (stat_bin).

This time , the distribution is quite better and similar to normal.The skewness is also low as compared to earlier one.

7)Total.sulfur.dioxide

grid.arrange(ggplot(aes(x = total.sulfur.dioxide), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = total.sulfur.dioxide ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Whitewine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The data is similar to previous variables.There also exist extremely large variables and a few outliers but most of the data has a bell-shaped normal distribution.Lets try to omit top 1 percentile ,thus below distribution is obtained:

ggplot(aes(x = total.sulfur.dioxide), data = Whitewine) + 
  geom_histogram(binwidth = 1, fill = '#5760AB') + 
  scale_x_continuous( limits = c(0, quantile(Whitewine$total.sulfur.dioxide, 0.99)))
## Warning: Removed 49 rows containing non-finite values (stat_bin).

As we can see that most of the data is between 50 and 240.

8)Density

grid.arrange(ggplot(aes(x = density), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = density ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Whitewine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The spread of density is very narrow.There are nearly no outliers.Lets use smaller bin sizes :

ggplot(aes(x = density), data = Whitewine) + 
  geom_histogram(binwidth = 0.0001)

Most of the density values are between 0.98 and 1. Lets omit top 1 percentile:

ggplot(aes(x = density), data = Whitewine) + 
  geom_histogram(binwidth = 0.0001) + 
  scale_x_continuous( limits = c(0.9871, quantile(Whitewine$density, 0.99)))
## Warning: Removed 49 rows containing non-finite values (stat_bin).

Most of the density data are accumulated between 0.990 and 0.997.

9)pH

grid.arrange(ggplot(aes(x = pH), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y = pH ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is not skewed and seems like bell-shaped but there are some outliers.Outliers exists on both sides which makes the distribution not skewed.If the binsize is narrowed:

ggplot(aes(x = pH), data = Whitewine) + 
  geom_histogram(binwidth = 0.01)

summary(Whitewine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The spread is quite narrow and has ignorable skewness.

10)Sulphate

grid.arrange(ggplot(aes(x = sulphates), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y =sulphates ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is bit right skewed.However there are few outliers.Lets narrow bin size:

ggplot(aes(x = pH), data = Whitewine) + 
  geom_histogram(binwidth = 0.01)

summary(Whitewine$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

After omitting top 1 percentile:

ggplot(aes(x = sulphates), data = Whitewine) + 
  geom_histogram(binwidth = 0.01) + 
  scale_x_continuous( limits = c(0.22, quantile(Whitewine$sulphates, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).

There exist some right skewness but it is very close to bell- shaped distribution.

11) Alcohol

grid.arrange(ggplot(aes(x = alcohol), data = Whitewine) + 
               geom_histogram(),
             ggplot(aes(x = 1, y =alcohol ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(aes(x = alcohol), data = Whitewine) + 
  geom_histogram(binwidth = 0.1) + 
  scale_x_continuous(breaks = seq(8,14,1))

summary(Whitewine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

There are still some outliers.The distribution seems to be multi-modal.These are (8.5-10),(10-11.5) and (11.5-13).The biggest aggregate exist (8.5-10) group. Most data exist at point 9.5. # Bivariate Plots

ggplot(aes(x = factor(quality), y = residual.sugar ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = residual.sugar, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(0, quantile(Whitewine$residual.sugar, 0.99)) + 
                ylim(3, 9) + 
                geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 47 rows containing non-finite values (stat_smooth).
## Warning: Removed 61 rows containing missing values (geom_point).

Average quality has very high variance conditional on residual sugar . For very close values of resdual sugar values, quality changes alot which means very low correlation.However extreme values have less quality.

Residual.sugar is between 1.5 and 5 the quality is best and highest mean of means.

Between 5 and 10 , variance in quality is very high nad quality mean reaches very high values.However mean of means is quite low.

ggplot(aes(x = factor(quality), y = alcohol ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = alcohol, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1)

ggplot(aes(x = alcohol, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                geom_smooth()
## `geom_smooth()` using method = 'gam'

THe mean and mean of means makes a pattern in quality conditional on alcohol.If extreme values are trimmed:

ggplot(aes(x = alcohol, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$alcohol, 0.01), 
                     quantile(Whitewine$alcohol, 0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 127 rows containing missing values (geom_point).

The trimmed model has better positive linear relationship between 9.5 and 1.3.Best qualities are between 12 and 13 alcohol level.

with(Whitewine, cor.test(alcohol, quality, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
ggplot(aes(x = factor(quality), y = volatile.acidity ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = volatile.acidity, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                geom_smooth()
## `geom_smooth()` using method = 'gam'

The graph shows that there is a negative relationship between volatile acidity and quality. Lets investigate the extreame points.

ggplot(aes(x = volatile.acidity, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$volatile.acidity, 0.08), quantile(Whitewine$volatile.acidity, 0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 317 rows containing non-finite values (stat_smooth).
## Warning: Removed 388 rows containing missing values (geom_point).

Trimming the extreme high points decreased the slope, however a negative relationship is still clearly seen.After 0.5 volatile acidity , the slope(relationship strength) increases.

with(Whitewine, cor.test(volatile.acidity, quality, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723
ggplot(aes(x = factor(quality), y = free.sulfur.dioxide ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = free.sulfur.dioxide, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(2, quantile(Whitewine$free.sulfur.dioxide,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 47 rows containing missing values (geom_point).

Quality and free.sulfur.dioxide has a positive relationship between 0 and 30. After 40, mean of means decreases and falls down to quality level of 6.

ggplot(aes(x = factor(quality), y = total.sulfur.dioxide ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = total.sulfur.dioxide, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$total.sulfur.dioxide,0.01), quantile(Whitewine$total.sulfur.dioxide,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).

Plot shows that total.sulfur.dioxide have positive relationship with quality between 0 and 90. The slope becomes negative after 100, but strength of relationship is low .It is clearly seen that the qualityvalue is robust between 75 and 150.For small values ,quality is very volatile.

summary(Whitewine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
ggplot(aes(x = factor(quality), y = chlorides ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

chlorides values are mostly cumulated around 0 and 0.1 .Lets take a look to the summary table:

summary(Whitewine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

From the summary we can see that even the 3rd quantile is 0.05.If we trim the extreme valuesand draw quality conditionals on chlorides:

ggplot(aes(x = chlorides, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(0,quantile(Whitewine$chlorides,0.95)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 237 rows containing non-finite values (stat_smooth).
## Warning: Removed 249 rows containing missing values (geom_point).

The plot of quality is very robust between 0.025 and 0.75 with a negative relationship with chlorides.The volatility increases after 0.10.

ggplot(aes(x = factor(quality), y = sulphates ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = sulphates, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$sulphates,0.01), 
                     quantile(Whitewine$sulphates,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 84 rows containing non-finite values (stat_smooth).
## Warning: Removed 96 rows containing missing values (geom_point).

It is dificult to say that ther e exist any relationship between quality and sulphates visually .There is just small increase around 0.8 value.

Lets look after the other variables like citric.acid,fixed acidity,density and pH relationship with quality.

ggplot(aes(x = factor(quality), y = citric.acid ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = citric.acid, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$citric.acid,0.01), 
                     quantile(Whitewine$citric.acid,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 68 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).

ggplot(aes(x = factor(quality), y = fixed.acidity ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = fixed.acidity, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$fixed.acidity,0.01), quantile(Whitewine$fixed.acidity,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 75 rows containing non-finite values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).

ggplot(aes(x = factor(quality), y = density ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = density, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$density,0.01), 
                     quantile(Whitewine$density,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).

with(Whitewine, cor.test(density, quality))
## 
##  Pearson's product-moment correlation
## 
## data:  density and quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
ggplot(aes(x = factor(quality), y = pH ), data = Whitewine) + 
               geom_jitter(alpha = 0.1 ) + 
               geom_boxplot(alpha = 0.3, color = 'blue' ) + 
  stat_summary(fun.y = "mean",
               geom = "point", 
               color = "red",
               shape = 8,
               size = 4)

ggplot(aes(x = pH, y = quality), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                xlim(quantile(Whitewine$pH,0.01), 
                     quantile(Whitewine$pH,0.99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 85 rows containing non-finite values (stat_smooth).
## Warning: Removed 94 rows containing missing values (geom_point).

Quality does not seem to vary conditional on pH and fixed acidity. However, quality seems to have relationship between density and citric acid. Especialy denser wines seems to have less quality value on average.

Lets try to check conditional with alcohol.

ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                ylim(0,quantile(Whitewine$residual.sugar,0.95)) + 
  coord_trans(y = 'sqrt') + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 240 rows containing non-finite values (stat_smooth).
## Warning: Removed 244 rows containing missing values (geom_point).

There is a decreasing trend in residual.sugar between 8 and 10 alcohol level.

ggplot(aes(x = alcohol, y = volatile.acidity), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                ylim(quantile(Whitewine$volatile.acidity,0.05), quantile(Whitewine$volatile.acidity,0.95)) + 
  coord_trans(y = 'sqrt') + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 400 rows containing non-finite values (stat_smooth).
## Warning: Removed 446 rows containing missing values (geom_point).

From the plot we can observe that there is increase for value more than 11.

with(subset(Whitewine, Whitewine$alcohol>11), 
     cor.test(volatile.acidity, alcohol, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and alcohol
## t = 14.107, df = 1559, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2917104 0.3797308
## sample estimates:
##       cor 
## 0.3364553
ggplot(aes(x = alcohol, y = total.sulfur.dioxide), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                ylim(quantile(Whitewine$total.sulfur.dioxide,0.05), quantile(Whitewine$total.sulfur.dioxide,0.95)) + 
  coord_trans(y = 'sqrt') + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 474 rows containing non-finite values (stat_smooth).
## Warning: Removed 485 rows containing missing values (geom_point).

There is decreasing trend of toatl.sulfur.dioxide for increasing alcohols.

with(Whitewine, cor.test(alcohol, total.sulfur.dioxide, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and total.sulfur.dioxide
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4709775 -0.4262443
## sample estimates:
##        cor 
## -0.4488921
ggplot(aes(x = alcohol, y = density), data = Whitewine) + 
  geom_jitter(alpha = 0.1, color = 'orange') + 
                ylim(quantile(Whitewine$density,0.05), quantile(Whitewine$density,0.95)) + 
  coord_trans(y = 'sqrt') + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 483 rows containing non-finite values (stat_smooth).
## Warning: Removed 491 rows containing missing values (geom_point).

with(Whitewine, cor.test(alcohol, density, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Density and alcohol have a strong negative relationship(-0.78) as seen in above graph and correlation calculations.

ggplot(aes(x = quality, y = alcohol), data = Whitewine) + 
  geom_boxplot(aes(group = quality))

High quality wines generally have high alcohol levels as per shown by the boxplot.

ggplot(aes(x = quality, y = residual.sugar), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$residual.sugar,0.01), quantile(Whitewine$residual.sugar,0.99))
## Warning: Removed 81 rows containing non-finite values (stat_boxplot).

Low and high quality wines include similar amount of sugar and the data points are quite volatile.It is difficult to trace out clear the pattern.

ggplot(aes(x = quality, y = density), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(0.9871,quantile(Whitewine$density,0.99))
## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

High quality wines clearly have low density on average.

ggplot(aes(x = quality, y = volatile.acidity), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$volatile.acidity,0.01), quantile(Whitewine$volatile.acidity,0.99))
## Warning: Removed 82 rows containing non-finite values (stat_boxplot).

It is difficult to detect for high quality wines and volatile acidity when extremes are trimmed.

ggplot(aes(x = quality, y = chlorides), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$chlorides,0.01), 
                     quantile(Whitewine$chlorides,0.99))
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).

ggplot(aes(x = quality, y = total.sulfur.dioxide), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$total.sulfur.dioxide,0.01), quantile(Whitewine$total.sulfur.dioxide,0.99))
## Warning: Removed 98 rows containing non-finite values (stat_boxplot).

ggplot(aes(x = quality, y = free.sulfur.dioxide), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$free.sulfur.dioxide,0.01), quantile(Whitewine$free.sulfur.dioxide,0.99))
## Warning: Removed 90 rows containing non-finite values (stat_boxplot).

ggplot(aes(x = quality, y = citric.acid), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$citric.acid,0.01), 
                     quantile(Whitewine$citric.acid,0.99))
## Warning: Removed 68 rows containing non-finite values (stat_boxplot).

Free sulfur dioxide is at similar amounts for different quality levels.Different quality wines have similar citric acid amount.However, high quality wines have significantly high amount of citric acid at 9.

ggplot(aes(x = quality, y = sulphates), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$sulphates,0.01), 
                     quantile(Whitewine$sulphates,0.99))
## Warning: Removed 84 rows containing non-finite values (stat_boxplot).

ggplot(aes(x = quality, y = fixed.acidity), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
                ylim(quantile(Whitewine$fixed.acidity,0.01), quantile(Whitewine$fixed.acidity,0.99))
## Warning: Removed 75 rows containing non-finite values (stat_boxplot).

Different qualities have quite similar amounts of sulphates and fixed acidity.

Bivariate Analysis

Bivariate analysis shows that other than alcohol none of the variable have direct linear relationship with quality.However , some variables have relationship with quality and between each other.It can be observed that for different mixture of inputs, high and low quality of inputs can be observed.

observed relationships

In the analysis we have seen that Acohol, volatile acidity and residual sugar were primary features of interest.

Alcohol was found to have a positive correlation(0.43) with quality.When smoothened, it increases which can be seen in the graph.From box plot with quality , high quality wines were observed to have better quality values.

Volatile acidity was found to have neagative correlation wit quality. When smoothened, the relationship is observed better. The box plot shows that high quality levels generally havelower volatile acidity.

However, as expected for strong relationship between residual sugar and quality was not found.The only observation is that low sugar level is included either in high or low quality wines. Mid quality includes high residual sugar level.

Ques:

Did you observe any intersting relationships between the other features?

Ans:

Although density was not prime a primary interest it was found that it have a negative relationship with quality and significant correlation. From box-plot diagram ,it can be observed that high quality wines are less dense when trimmed.

Ques:

What was the strongest relationship found?

Ans:

The strongest relationship was found between alcohol and density variable. ***


Multivariate Plots

Whitewine$quality_grouped <- cut(Whitewine$quality, c(2,4,7,9))
ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) + 
  geom_point(aes(color = quality_grouped), 
             stat = 'summary', fun.y = mean) + 
  scale_color_brewer(type = 'seq', 
                     guide=guide_legend(title = 'quality_grouped'))

When quality_grouped into alcoho-residual.sugar, the high-quality wines generally have high leve; alcohol. The darker blue points shows the high quality.

ggplot(aes(x = alcohol, y = residual.sugar, 
           color = factor(quality)), data = Whitewine) + 
  geom_point(alpha = 0.8, size = 1) + 
  ylim(quantile(Whitewine$residual.sugar, 0.01), 
       quantile(Whitewine$residual.sugar, 0.99)) + 
             geom_smooth(method = "lm", se = FALSE, size=1) + 
  scale_color_brewer(type = 'seq', 
                     guide=guide_legend(title = 'Quality'))
## Warning: Removed 81 rows containing non-finite values (stat_smooth).
## Warning: Removed 81 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_smooth).

An important implications of the graph is thathigh quality wines generally have low residual.sugar level(less than 5).However,low level of sugar does not mean high quality wines.

Whitewine$quality_grouped <- cut(Whitewine$quality, c(2,4,7,9))
ggplot(aes(x = alcohol, y = density), data = Whitewine) + 
  geom_point(aes(color = quality_grouped), 
             stat = 'summary', fun.y = mean) + 
  scale_color_brewer(type = 'seq', guide=guide_legend(title = 'quality_grouped'))

ggplot(aes(x = alcohol, y = density, 
           color = factor(quality)), data = Whitewine) + 
  geom_point(alpha = 0.8, size = 1) + 
  ylim(quantile(Whitewine$density, 0.01), 
       quantile(Whitewine$density, 0.99)) + 
             geom_smooth(method = "lm", se = FALSE, size=1) + 
  scale_color_brewer(type = 'seq', 
                     guide=guide_legend(title = 'Quality'))
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing missing values (geom_smooth).

From the above plot it can be observed that most low quality wines are composed of low alcohol and high density and most high quality wines are composed of low density and high alcohol level.The medium quality are spread all over the graph.

ggplot(aes(x = alcohol, y = citric.acid), data = Whitewine) + 
  ylim(quantile(Whitewine$citric.acid,0.01), 
       quantile(Whitewine$citric.acid, 0.99)) + 
  geom_point(aes(color = quality_grouped), 
             stat = 'summary', fun.y = mean) + 
  scale_color_brewer(type = 'seq', 
                     guide=guide_legend(title = 'quality_grouped'))
## Warning: Removed 68 rows containing non-finite values (stat_summary).

ggplot(aes(x = alcohol, y = citric.acid, 
           color = factor(quality)), data = Whitewine) + 
  geom_point(alpha = 0.8, size = 1) + 
  ylim(quantile(Whitewine$citric.acid, 0.01), 
       quantile(Whitewine$citric.acid, 0.99)) + 
             geom_smooth(method = "lm", se = FALSE, size=1) + 
  scale_color_brewer(type = 'seq', 
                     guide=guide_legend(title = 'Quality'))
## Warning: Removed 68 rows containing non-finite values (stat_smooth).
## Warning: Removed 68 rows containing missing values (geom_point).

summary(Whitewine$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
summary(Whitewine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

It is clearly seen that high quality wines have more than median citric acid level. From summary , we can see that high quality wines are distributes between median and 3rd quantilecitric.acid values.Nearly none of high quality wines produced with low citric acid values.

Whitewine$alcohol_grouped <- cut(Whitewine$alcohol, 
                                 c(7.50,9,10.50,12,16))
ggplot(aes(x = factor(quality), y = volatile.acidity), 
       data = Whitewine) + 
  geom_boxplot(aes(fill = alcohol_grouped)) +  
  scale_fill_brewer(type = 'seq', 
                    guide=guide_legend(title = 'alcohol_grouped'))

The plot shows thathigh quality wines have high level alcohol and low level volatile acidity.An importantinference can be found that difference conditional on alcohol is seen as acidity increases.

ggplot(aes(x = alcohol, y = chlorides), data = Whitewine) + 
  ylim(quantile(Whitewine$chlorides, 0.01), 
       quantile(Whitewine$chlorides, 0.99)) + 
             geom_boxplot(aes(fill = alcohol_grouped)) + 
  scale_fill_brewer(type = 'seq', 
                    guide=guide_legend(title = 'alcohol_grouped'))
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).

There exist a negative relationship between quality and chlorides.Low chlorides and high alcohol level shows a good quality measure.

ggplot(aes(x = factor(quality), 
           y = total.sulfur.dioxide), data = Whitewine) + 
  geom_boxplot(aes(fill = alcohol_grouped)) +  
  scale_fill_brewer(type = 'seq', 
                    guide=guide_legend(title = 'alcohol_grouped'))

High quality wines are generally cummulated below total.sulfur.dioxide value of 150.

Multivariate Analysis

In multivariate analysis, quality and alcohol values are grouped and factorized to get more better visualization.

Observed relationships in this part:

High quality wines were having high level of alcohol of residual sugar , low density , high citric acid, low chlorides, low sulfur dioxide and low volatile acidity.

The 3rd dimension increased the quality of visualizationand pattern detection. Since relationship was non-linear, grouping alcohpl and quality wines values provided further insughts.

The medium quality wines are dispersed all over the graph.Therfore further analysis should be conducted to detect detailed visuals.

QUES:

Were there any intersting interactions between features?

Ans:

Alcohol and residual sugar interactions are surprising. ***


Final Plots And Summary

Plot 1:

ggplot(aes(x = quality, y = alcohol), data = Whitewine) + 
  geom_boxplot(aes(group = quality)) + 
  ggtitle('Alcohol Wine Quality Box Plot') + 
  labs(y = "alcohol (% by volume)", 
x = "quality(score between 0 and 10)")

Plot Description

Alcohol and quality has a positive relationship (after quality of 5) and the relationship is very close to linear. Although there exist fewer data points, there exist a negative relationship between quality values of 3 to 5. However, when extremes are trimmed as in the first graph, it is easier to observe the trend.

Plot 2:

ggplot(aes(x = factor(quality), 
           y = volatile.acidity) , data = Whitewine) + 
  geom_boxplot(aes(fill =alcohol_grouped)) + 
  scale_fill_brewer(type = 'seq', 
                    guide=guide_legend(title = 'alcohol_grouped')) + 
  ggtitle('Wine Quality Volatile Acidity by Alcohol Graph') + 
  labs(x = "volatile acidity (acetic acid - g / dm^3) ", 
y = "quality(score between 0 and 10)")

Plot Description

The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.

Plot 3:

ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) + 
  geom_point(alpha = 0.1, position = position_jitter(h=0), color = 'orange') + 
  ylim(0,quantile(Whitewine$residual.sugar, 0.95)) + 
  coord_trans(y = 'sqrt') + 
             geom_smooth() + 
  ggtitle('Alcohol Residual Sugar Graph') + 
   labs(x = "Alcohol(% by volume)", 
y = "Residual Sugar(g / dm^3)")
## `geom_smooth()` using method = 'gam'
## Warning: Removed 240 rows containing non-finite values (stat_smooth).
## Warning: Removed 240 rows containing missing values (geom_point).

Plot Description:

The negative relationship between alcohol and residual sugar is detached. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly. ***


Reflections

This analysis is conducted to explore features of white wines and their relationships among them. The main purpose of this was to feature values in high quality and low quality wines.This Analysis has helped to extract the ain features of high quality wines and low quality wines.The middle quality wines generally do not have extream values.Although mid quality wines are dispersed all over the graph.

Box plots of features by quality values helpes to detect small differences among groups which were quite impossible to do from point and line graphs.

Although there are some features at different values in citric acid level, chorides, residual sugar and density levels which have same high and low wines quality values.From this analysis common properties for high quality wines can be extracted and any company can use it .

Some limitations from all the interpretations made difficult.The 10 point scale may be the limitation.The wine types is suppressed between quality 3 and 9.

There were no 10 point wines and not 1 or 2 which made most of the wines middle quality.This may be due to ceiling and floor effect on the quality ratings on the wines. ***